Tech Arsenal 1

home *** CD-ROM | disk | FTP | other *** search

/ Tech Arsenal 1 / Tech Arsenal (Arsenal Computer).ISO / tek-03 / huffman2.zip / HUFFMAN2.DOC < prev

Wrap

Text File | 1992-05-29 | 4KB | 146 lines

HUFFMAN2.BAS and DECODER.BAS By Rich Geldreich 1992 The two QuickBASIC 4.5 programs you have just received can compress and decompress files with the Huffman compression algorithm. The first program HUFFMAN2.BAS, takes an input file and compresses it to OUTPUT.HUF. The second program, DECODER.BAS, takes OUTPUT.HUF and decompresses it a file specified via the command line. The two programs should be compiled for maximum speed. To compress EATME.DOC, use: huffman2 eatme.doc To decompress the compressed file to EATME2.DOC, use: decoder eatme2.doc Easy, right? I have released these two programs to show that Huffman compression isn't really that hard to implement. What is Huffman compression? It's a simple compression algorithm which takes advantage of a file's character distribution: if "A" is used more than "B", for instance, then the code representing "A" is significantly smaller than the code that represents "B". It's not the best, but it isn't that bad. Lets say we want to compress the string "ABCDDBBAAAADD". First, we must scan the string and tally up the count for each character: A is used 5 times B is used 3 times C is used 1 time D is used 4 times Now that we have this distribution, we must start making a binary tree that represents all of these characters; like this: (* = not used) 5 3 1 4 / \ / \ / \ / \ A * B * C * D * Now we must start combining "nodes" until there is only one node at the top. To do this we keep on combining 2 nodes which have the lowest frequency: 1 and 3 have the lowest frequency: 5 4 4 / \ / \ / \ A * B C D * 4 and 4 have the lowest frequency: 5 8 / \ / \ A * D 4 / \ B C And finally, 5 and 8 have the lowest frequency: 13 / \ A 8 / \ D 4 / \ B C To represent this tree, both programs use a linked list. To represent each node, the Father() array is used. To represent the left and right arrows LeftSon() and RightSon() arrays are used. No that we have the combined tree, we must travel down the tree and find out which bit sequence represents each character we wish to compress. (A recursive procedure is used to do this.) We start at the top, and every time we go left we send a "1" and every time we go right we send a "0". So "A" will be represented with "1", "B" with "001", "C" with "000", & "D" with "01". Notice that the character used most in the original string, "A", has the smallest code size, and the character used least, "C", has the biggest code size. A = 1 [used in 38% of the string] B = 001 [used in 23%] C = 000 [used in 7%] D = 01 [used in 31%] "ABCDDBBAAAADD" will be compressed as: 1 001 000 01 01 001 001 1 1 1 1 01 01 or 10010000 10100100 11111010 1 ...so the original 13 character sequence has been compressed to just 4 bytes. To decompress this 4 byte sequence, the decoder must have the encoding tree handy(which must be send separately). The encoding tree was: 13 / \ A 8 / \ D 4 / \ B C The output was: 10010000 10100100 11111010 1 Start at the top. The first bit is "1"- so go left. The first character is then "A". So far so good! Now we go back to the top and start all over... The second bit is "0"- so go right. We're now at "8". The third bit is "0"- so go right again. We're now at "4". The fourth bit is "1"- so go left. We're now at "B". "B" is the second character. Keep on repeating this until all of the characters have been decoded. That's all! Have fun. The two QB programs are in the public domain so do what you want with 'em. If you have any questions or cool ideas, I can be contacted at the following address: Rich Geldreich 410 Market St. Gloucester City, NJ 08030 (609)-742-8752 May 29th, 1992